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Abstract 

High-dimensional data common in genomics, proteomics, and chemometrics often 
contains complicated correlation structures. Recently, partial least squares (PLS) and 
Sparse PLS methods have gained attention in these areas as dimension reduction tech- 
niques in the context of supervised data analysis. We introduce a framework for 
Regularized PLS by solving a relaxation of the SIMPLS optimization problem with 
penalties on the PLS loadings vectors. Our approach enjoys many advantages includ- 
ing flexibility, general penalties, easy interpretation of results, and fast computation 
in high-dimensional settings. We also outline extensions of our methods leading to 
novel methods for Non-negative PLS and Generalized PLS, an adaption of PLS for 
structured data. We demonstrate the utility of our methods through simulations and 
a case study on proton Nuclear Magnetic Resonance (NMR) spectroscopy data. 
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1 Introduction 



Technologies to measure high-throughput biomedical data in proteomics, chemometrics, and 
genomics have led to a proliferation of high-dimensional data that pose many statistical 
challenges. As genes, proteins, and metabolites, are biologically interconnected, the variables 
in these data sets are often highly correlated. In this context, several have recently advocated 
using partial least squares (PLS) for dimension reduction of supervised data, or data with 



a response or labels (Nguyen and Rocke, 2002b Boulesteix and Strimmer, 2007 Rossouw 



et al. 


2008 


Chun and Kele§ 


2010 


) . First introduced by 


Wold 


(1966) 



that uses least squares on a set of derived inputs accounting for multi-colinearities, others 



have since proposed alternative methods for PLS with multiple responses (de Jong, 1993) 



and for classification (Marx 1996 Barker and Rayens 2003). More generally, PLS can 
be interpreted as a dimension reduction technique that finds projections of the data that 
maximize the covariance between the data and the response. Recently, several have proposed 
to encourage sparsity in these projections, or loadings vectors, to select relevant features in 



high-dimensional data (Rossouw et al. , 2008 Chun and Kele§, 2010). In this paper, we seek a 
more general and flexible framework for regularizing the PLS loadings that is computationally 
efficient for high- dimensional data. 

There are several motivations for regularizing the PLS loadings vectors. Partial least 
squares is closely related to principal components analysis (PC A); namely, the PLS loadings 
can be computed by solving a generalized eigenvalue problem ( |de Jong 1993). Several have 
shown that the PCA projection vectors are asymptotically inconsistent in high-dimensional 



settings (Johnstone and Lu 2009 Jung and Marron 2009). This is also the case for the PLS 



loadings, recently shown in Nadler and Coifman (2005) and Chun and Kele§ (2010). For 
PCA, encouraging sparsity in the loadings has been shown to yield consistent projections 



(Johnstone and Lu, 2009 Amini and Wainwright 2009). While an analogous result has not 
yet been shown in the context of PLS, one could surmise that such a result could be attained. 



In fact, this is the motivation for Chun and Kele§ (2010 )'s recent Sparse PLS method. In 
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addition to consistency motivations, sparsity has many other quahties to recommend it. 
The PLS loadings vectors can be used as a data compression technique when making future 
predictions; sparsity further compresses the data. As many variables in high-dimensional 
data are noisy and irrelevant, sparsity gives a method for automatic feature selection. This 
leads to results that are easier to interpret and visualize. 

While sparsity in PLS is important for high-dimensional data, there is also a need for 
more general and flexible regularized methods. Consider NMR spectroscopy as a motivating 
example. This high-throughput data measures the spectrum of chemical resonances of all 



the latent metabolites, or small molecules, present in a biological sample (Nicholson and Lin- 



don, 2008). Typical experimental data consists of discretized, functional, and non-negative 



spectra with variables measuring in the thousands for only a small number of samples. 
Additionally, variables in the spectra have complex dependencies arising from correlation 
at adjacent chemical shifts, metabolites resonating at more than one chemical shift, and 



overlapping resonances of latent metabolites (De Graaf 2007). Because of these complex de- 



pendencies, there is a long history of using PLS to reduce the NMR spectrum for supervised 



data (Goodacre et al. , 2004 Dunn et al. , 2005). Classical PLS or Sparse PLS, however, are 



not optimal for this data as they do not account for the non-negativity or functional nature 
of the spectra. In this paper, we seek a more flexible approach to regularizing PLS load- 
ings that will permit (i) general penalties such as to encourage sparsity, group sparsity, or 
smoothness, (ii) constraints such as non-negativity, and (iii) directly account for known data 
structures such as ordered chemical shifts for NMR spectroscopy. Our framework, based on 



a penalized relaxation of the SIMPLS optimization problem (de Jong 1993), also leads to a 
more computationally efficient numerical algorithm. 

As we have mentioned, there has been previous work on penalizing the PLS loadings. 



For functional data, Goutis and Fearn (1996) and Reiss and Ogden (2007) have extended 



PLS to encourage smoothness by adding smoothing penalties. Our approach is more closely 



related to the Sparse PLS methods of Rossouw et al. (2008) and Chun and Kele§ (2010). In 



the latter, a generalized eigenvalue problem related to PLS objectives is penalized to achieve 



sparsity, although they solve an approximation to this problem via the elastic net Sparse 



PCA approach of Zou et al. (2006). Noting that PLS can be interpreted as performing PCA 



on the deflated cross-products matrix, Rossouw et al. (2008) replace PCA with Sparse PCA 



using the approach of Shen and Huang (2008). We choose to adopt a more direct approach. 
Our Sparse PLS method, instead, penalizes a generalized SVD problem directly with an 
£i-norm penalty that is a concave relaxation of the SIMPLS criterion; our method, then, is 



more closely related to the Sparse PCA approaches of Witten et al. (2009) and Allen et al. 



(2011). We will show that this more direct framework has numerous advantages includ- 
ing generalizations permitting various penalties that are norms, non-negativity constraints, 
generalizations for structured data, greater algorithmic flexibility, and fast computational 
approaches for high-dimensional data. 

The paper is organized as follows. Our framework for Regularized Partial Least Squares 
(RPLS) is introduced in Section [2| In Section [sj we introduce two novel extensions of PLS 
and RPLS: Non-negative PLS and Generalized PLS for structured data. We illustrate the 
comparative strengths of our approach in Sections |4] and |5] through simulation studies and 
a case study on NMR spectroscopy data, respectively, and conclude with a discussion in 
Section |6l 



2 Regularized Partial Least Squares 

In this section, we introduce our framework for regularized partial least squares. While most 
think of PLS as a regression technique, here we separate the steps of the PLS approach 
into the dimension reduction stage where the PLS loadings and factors are computed and 
a prediction stage where regression or classification using the PLS factors as predictors is 
performed. As our contributions lie in our framework for regularizing the PLS loadings in the 
dimension reduction stage, we focus on this in the first three subsections, and then discuss 



considerations for regression and classification problems in Section 2.4 
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2.1 RPLS Optimization Problem 

Introducing notation, we observe data (predictors), X G 3?"^'', with p variables measured 
on n samples and a response Y G 3?"^''. We will assume that the columns of X have been 
previously standardized. The possibly multivariate response (g > 1) could be continuous as 



in regression or encoded by dummy variables to indicate classes as in Barker and Rayens 



(2003), a consideration which we ignore while developing our methodology. The px q sample 
cross-product matrix is denoted as M = X"^ Y. 

Both of the two major algorithms for computing the multivariate PLS factors, NIPALS 



(Wold, 1966) and SIMPLS (de Jong, 1993), can be written as solving a single-factor eigen- 
value problem of the following form at each step: maximizcv v^MM^v subject to v^v = 



1, where v E W are the PLS loadings. Chun and Kele§ (2010) extend this problem by adding 



an £i-norm constraint, || v || < t, to induce sparsity and solve an approximation to this prob- 



lem using the Sparse PC A method of Zou et al. (2006). Rossouw et al. (2008) replace this 



optimization problem with that of the Sparse PCA approach of Shen and Huang (2008) 



We take a simpler and more direct approach. Notice that the single factor PLS problem 
can be re-written as the following: maximizev,u v-'^Mu subject to v"^ v = 1 & u"^ u = 1, 
where u G 3?'' is a nuisance parameter. The equivalence of these problems was pointed out 



m 



de Jong (1993) and is a well understood matrix analysis fact (Horn and Johnson, 1985). 



Our single-factor RPLS problem penalizes a direct concave relaxation of this problem: 



maximize v^Mu— AP(v) subject to v^v < 1 & u^u = 1. 



Here, we assume that P{) is a convex penalty function that is a norm or semi-norm; these 
assumptions are discussed further in the subsequent section. To induce sparsity, for example, 
we can take -P(v) = || v ||i. Notice that we have relaxed the equality constraint for v to 
an inequality constraint. In doing so, we arrive at an optimization problem that is simple 
to maximize via an alternating strategy. Fixing u, the problem in v is concave, and fixing 
V the problem is a quadratically constrained linear program in u with a global solution. 



Our approach is most closely related to some recent direct bi-concave relaxations for two- 



way penalized matrix factorizations (Witten et al. , 2009 Allen et al. , 2011). Studying the 



solution to this problem and its properties in the subsequent section will reveal some of the 
major advantages of this optimization approach. 

Computing the multi-factor PLS solution via the two traditional multivariate approaches, 
SIMPLS and NIPALS, require solving optimization problems of the same form as the single- 
factor PLS problem at each step. The SIMPLS method is more direct and has several 
benefits within our framework; thus, this is the approach we adopt. The algorithm begins 
by solving the single-factor PLS problem; subsequent factors solve the single-factor problem 
for a Gram-Schmidt deflated cross-products matrix. If we let the matrix of projection weights 
Rfc G 3?^^'^ be defined recursively then, = [Rfc_i X'^z^ / zj^ z^] where = Xv^ is the 



fc*'^ sample PLS factor. The Gram-Schmidt projection matrix G 3?^^^ is given by P 



k 



I — Rfc(R^ Rfc) ^ Rfc, which ensures that X"^ X Vj = for j < k. Then, the optimization 
problem to find the k^^ SIMPLS loadings vector is the same as the single-factor problem 
with the cross-products matrix, M, replaced by the deflated matrix, M = Pk~i M 



(de Jong, 1993). Thus, our multi-factor RPLS replaces M in ([T]) with M to obtain the k 



th 



RPLS factor. 

The deflation approach employed via the NIPALS algorithm is not as direct. One typi- 
cally defines a deflated matrix of predictors and responses, Xjt = X(I — Rfc ) and Y^, = 
Y(I — VfeR^), with the matrix of projection weights defined as above, and then solves an 
eigenvalue problem in this deflated space: maximizewj. w^X^ Y^Y^X^Wfe subject to w^^w^ 



1 ( fWoIdl [l966| ). The PLS loadings in the original space are then recovered by 



Wfc(RfcWfc) ^. While one can incorporate regularization into the loadings, Wk (as sug- 



gested in Chun and Kele§ (2010)), this is not as desirable. If one estimates sparse deflated 
loadings, w, then much of the sparsity will be lost in the transform to obtain V. In fact, 
the elements of V will be zero if and only if the corresponding elements of W are zero for 
all values of k. Then, each of the K PLS loadings will have the exact same sparsity pattern, 
loosing the flexibility of each set of loadings having adaptively different levels of sparsity. 
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Given this, the more direct deflation approach of SIMPLS is our preferred framework. 



2.2 RPLS Solution 

A major motivation for our optimization framework for RPLS is that it leads to a simple and 
direct solution and algorithm. Recall that the single-factor RPLS problem, ([T]), is concave 
in V with u fixed and is a quadratically constrained linear program in u with v fixed. Thus, 
wc propose to solve this problem by alternating maximizing with respect to v and u. Each 
of these maximizations has a simple analytical solution: 

Proposition 1. Assume that P{) is convex and homogeneous of order one, that is P{) is 
a norm or semi-norm. Let u be fixed at u' such that M u' 7^ or v fixed at v' such that 
M"^v' 7^ 0. Then, the coordinate updates, u* and v* , maximizing the single-factor RPLS 
problem, ([T]), are given by the following: Let v = argmin^{^|| Mu' — v |p — AP(v)}. Then, 
V* = v/||v||2 if ||v||2 > and v* = otherwise, and u* = v' /|| v' II2. When these 
factors are updated iteratively, they monotonically increase the objective and converge to a 
local optimum. 



While the full proof of this result is given in the appendix, we note that this follows closely 



the Sparse PGA approach of Witten et al. (2009) and the use general penalties within PGA 



problems of Allen et al. (2011). Our RPLS problem can then be solved by a multiplicative 



update for u and by a simple re-scaled penalized regression problem for v. The assumption 
that P{) is a norm or semi-norm encompasses many penalties types including the £i-norm 



or lasso (Tibshirani, 1996) and the £i/£2-norm or group lasso (Yuan and Lin, 2006), fused 



lasso Tibshirani et al. (2005). For many possible penalty types, there exists a simple solution 
to the penalized regression problem. With a lasso penalty, P(v) = || v ||i, for example, the 
solution is given by soft-thresholding: v = ^(M u. A), where ^(x. A) = sign(a;)(|a;|— A)+ is the 
soft-thresholding operator. Our approach gives a more general framework for incorporating 
regularization directly in the PLS loadings that yield simple and computationally attractive 
solutions. 
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Algorithm 1 fC-Factor Regularized PLS 



r(l) 



1. Center the columns of X and Y. Let M 

2. For k = l...K: 

(a) Initialize and to the first left and right singular vectors of M' 

(b) Repeat until convergence: 

i. Set Uk 



||(M-0^v,||2 

ii. Set Vfc = argminy/_ s | |M — v'^ 1 12 — XkP(y[ 

iii. Set Vfc = Vfc/||vfc||2 if ||vfc||2 > 0, and set = and exit the algorithm 
otherwise. 

(c) RPLS Factor: = Xv^. 

(d) RPLS projection matrix: Set R^'^^ = [R^^"^) X^ / z^ z^] and = I - 



Ak) 



(e) Orthogonalization Step: M ' ' =PfcM' 
3. Return RPLS Factors zi . . .zk and RPLS Loadings: vi . . . v^- 



We note that the RPLS solution is guaranteed be at most a local optimum of Q, a result 



that is typical of other penalized PCA problems (Zou et al. , 2006 Shen and Huangl 2008 



Witten et al. 2009 Lee et al. 2010; Allen et al. 2011) and sparse PLS methods (Rossouw 



et al. , 2008 Chun and Kele§, 2010). For a special case, however, our problem has a global 



solution: 

Corollary 1. When q = 1, that is when Y is univariate, then the global solution to the single- 
factor penalized PLS problem ([T]) is given by the following: Let v = argmin^j 1 1 1 M — v | p — 
AP(v)}. Then, v* =v/||v||2 «/||v||2 >0 and \* =0 otherwise. 

This, then is an important advantage of our framework over competing methods. 



2.3 RPLS Algorithm 

Given our RPLS optimization framework and solution, we now put these together in the 
RPLS algorithm. Algorithm [Tj Note that this algorithm is a direct extension of the SIMPLS 



8 



algorithm (de Jong, 1993), where the solution to our single- factor RPLS problem, ([T]), re- 
places the typical eigenvalue problem in Step 2 (b). Since our RPLS problem is non-concave, 
there are potentially many local solutions and thus the initializations of u and v are impor- 



tant. Similar to much of the Sparse PCA literature (Zou et al. , 2006 Shen and Huang, 2008) 



we recommend initializing these factors to the global single-factor SVD solution. Step 2 (a). 
Second, notice that choice of the regularization parameter. A, is particularly important. If 
A is large enough that = 0, then the /c*'* RPLS factor would be zero and the algorithm 
would cease. Thus, care is needed when selecting the regularization parameters to ensure 



they remain within the relevant range. For the special case where q = 1, computing A 



(fc) 

max J 



the value at which = 0, is a straightforward calculation following from the Karush-Khun- 



Tucker conditions. With the LASSO penalty, for example, this gives A 



(fc) 



max,- M • 



(Friedman et al. , 2010). For general q, however, Xmax does not have a closed form. While 



one could use numerical solvers to find this value, this is a needless computational effort. 
Instead, we recommend to perform the algorithm over a range of A values, discarding any 
values resulting in a degenerate solution from consideration. Finally, unlike deflation-based 
Sparse PCA methods which can exhibit poor behavior for very sparse solutions, due to or- 
thogonalization with respect to the data, our RPLS loadings and factors are well behaved 
with large regularization parameters. 

Selecting the appropriate regularization parameter. A, is an important practical con- 
sideration. Existing methods that incorporate regularization in PLS have suggested using 
cross-validation or other model selection methods in the ultimate regression or classification 



stage of the full PLS procedure (Reiss and Ogden 2007 Chun and Kele§ 2010). While one 



could certainly implement these approaches within our RPLS framework, we suggest a sim- 
pler and more direct approach. We select A within the dimension reduction stage of RPLS, 
specifically in Step 2 (b) of our RPLS Algorithm. Doing so, has a number of advantages. 
First, this increases flexibility as it separates selection of A from deciding how many factors, 
K, to use in the prediction stage, permitting a separate regularization parameter, A^, to be 
selected for each RPLS factor. Second, coupling selection of the regularization parameter to 
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the prediction stage requires fixing the supervised modehng method before computing the 
RPLS factors. With our approach, the RPLS factors can be computed and stored to use as 
predictors in a variety of modehng procedures. Finally, separating selection of and K in 
the prediction stage is computationally advantageous as a grid search over tuning parame- 
ters is avoided. Nesting selection of A within Step 2 (b) is also faster as recent developments 
such as warm starts and active set learning can be used to efficiently fit the entire path of 



solutions for many penalty types (Friedman et al. , 2010). Practically, selecting Afc within the 
dimension reduction stage is analogous to selecting the regularization parameters for Sparse 



^ (fc) I 
PCA methods on M . Many approaches including cross-validation (Shen and Huang 



2008 



Owen and Perry, 2009) and BIG methods (Lee et ah 2010 Allen et al. , 2011) have been 



suggested for this purpose; in results given in this paper, we have implemented the BIG 



method as described in Allen et al. (2011). Selection of the number of RPLS factors, 
will largely be dependent on the supervised method used in the prediction stage, although 
cross-validation can be used with an method. 

Gomputationally, our algorithm is an efficient approach. As discussed in the previous 
section, the particular computational requirements for computing the RPLS loadings in 
Step 2 (b) are penalty specific, but are minimal for a wide class of commonly used penalties. 
Beyond Step 2 (b), the major computational requirement is inverting the weight matrix, 
R^Rfc, to compute the projection matrix. Since this matrix is found recursively via the 
Gram-Schmidt scheme, however, employing properties of the Schur complement can reduce 



the computational effort to that of matrix multiplication 0{pk) (Horn and Johnson, 1985). 
Finally, notice that we take the RPLS factors to be the direct projection of the data by 
the RPLS loadings. Overall, the advantages of our RPLS framework and algorithm include 
(1) computational efficiency, (2) flexible modeling, and (3) direct estimation of the RPLS 
loadings and factors. 
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2.4 RPLS for Regression and Classification 

While many think of PLS as a single approach to regression, we have separated the dimension 
reduction stage from the prediction stage where the PLS factors, Z, replace the original 



predictors. As many have advocated using PC A, or even supervised PCA (Bair et al. , 2006), 
as a dimension reduction technique prior supervised modeling, RPLS may be a powerful 
alternative in this context. While studying the behavior of our RPLS method for particular 
supervised techniques is beyond the scope of this paper, we outline here some considerations 
for using our framework in common regression and classification problems. 

Applying our RPLS framework in regression problems where Y encodes the response, 
is straightforward. For univariate responses, our framework has the added benefit that 
each RPLS loadings vector is the global solution to the underlying penahzed optimization 
problem. With traditional PLS regression, there is an interesting connection between Krylov 
sequences and the PLS regression coefficients, namely the latter are the minimum to a least 
squares problem constrained so that the coefficients lie within the Krylov subspace spanned 



by {Y,XX' Y, (XX^)2 Y, . . . , (XX^)^*^' Y} (iKramerl l2007|) . As we take the RPLS factors 



to be a direct projection of the RPLS loadings, this connection to Krylov sequences is broken, 
although perhaps for prediction purposes, this is immaterial. 

While in the context of regression, traditional approaches such as cross-validation can be 
used to find the number of RPLS factors, K < n, an approach suggested by [Huang et al. 



(2004) may have added benefits for large data sets. They propose to post-select the number of 
factors by adding a sparse penalty, minimizing the following criterion: ||Y — Z/3||2 + 7||/3||i. 
As the PLS or RPLS factors are orthogonal, there is a simple solution for the coefficients 
that automatically selects the number of factors, (3 = S{7i^ Y, 7). In our simulation study in 
Section |4| we use this penalization approach to automatically selecting the number of RPLS 
factors. Also we note that a recent paper directly computes the degrees of freedom for PLS 



regression that can be used for model selection with BIC and AIC methods (Kramer and 



Sugiyama 20111 ). As the relationship between Krylov sequences and our RPLS factors no 
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longer holds, however, this approach cannot be directly employed with our methods. 

Many have suggested using the PLS factors for classification by coding the response as 



dummy variables indicating the classes for discriminant analysis (Barker and Rayens, 2003) 



or by using the exponential family links for generalized linear models (Marx, 1996 Chung 



and Keles, 2010), approaches that can be used in conjunction with RPLS. Interestingly, 



Barker and Rayens (2003) have shown that coding the response with dummy variables scaled 



according the the class size yields PLS loadings vectors that are a scaled version of Fisher's 
discriminant vectors. Thus, our RPLS framework may lead to an alternative formulation for 
sparse or regularized LDA, a connection which we leave to future work to explore. Finally, 
while again cross-validation approaches can be used to select the number of RPLS factors 
for classification, it is common to compute a number of discriminant vectors equal to the 
number of classes. For our case study in Section |5| this is the approach we adopt. 



3 Extensions 

As our framework for regularizing PLS is general, there are many possible extensions of 
our methodology. We focus here on two novel extensions of PLS and RPLS that will be 
particularly useful for understanding spectroscopy data. These include generalizations for 
PLS and RPLS with structured data and non-negative PLS and RPLS. 



3.1 Generalized PLS for Structured Data 



Recently, Allen et al. (2011 ) proposed a generalization of PCA (GPCA) that is a appropriate 



for high-dimensional structured data, or data in which the variables are associated with some 
known distance metric. As motivation, consider NMR spectroscopy data where variables are 
ordered on the spectrum and variables at adjacent chemical shifts are known to be highly 
correlated. Classical multivariate techniques such as PCA and PLS ignore these structures; 
GPCA encodes structure into a matrix factorization problem through positive semi-definite 



quadratic operators such as Laplacians or kernel smoothers (Allen et al. , 2011 Allen and 
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Maletic-Savatic , 2011). Similar to GPCA, we seek to directly account for known structure 
in PLS and within our RPLS framework. 

Let us define the quadratic operator, Q G W^^^ : Q ^ 0, that encodes the known 
structural relationships between variables. By transforming all inner-product spaces to those 
induced by the Q-norm, we can define our single-factor Generalized RPLS optimization 
problem in the following manner: 

maximize v"^QMu— AP(v) subject to v^Qv<l, & u"^u = l. (2) 

v,u 

For the multi-factor Generalized RPLS problem, the factors and projection matrices are 
also changed. The k^^ factor is given by = XQv^, the weighting matrix, = 
[Rfc-i X"^ Zfc / zl Zfc] as before, and the projection matrix is = I — R^(R^ Q Rfc)^"*^ R^ . 
The deflated cross-products matrix is then given by M = Pfc_i Q M . Note that if 
A = and if the inequality constraint is forced to be an equality constraint, then we have the 
optimization problem for Generalized PLS. Notice also that instead of enforcing orthogonal- 
ity of the PLS loadings with respect to the data, v|^X"^Xvj, the Generalized PLS problem 
enforces orthogonality in a projected data space, v^QX^^XQvj. If we let Q be a matrix 
square root of Q as defined in Allen et al. (2011), then ^ is equivalent to the multi-factor 



RPLS problem for X = X Q and v = Q v. This equivalence is shown in the proof of the 
solution to (j2]). 

As with PLS and our RPLS framework. Generalized PLS and RPLS can be solved by 
coordinate-wise updates that converge to the global and local optimum respectively: 

Proposition 2. 1. Generalized PLS: The Generalized PLS problem, (|2| when \ = 0, is 
solved by the first set of GPGA factors o/M. The global solution to the Generalized 
PLS problem can be found by iteratively updating the following until convergence: v = 



M u /I I M u IIq and u = M"^ Q v /| | M"^ Q v 1 12, where |q is defined as a/x"^ Q x. 

2. Generalized RPLS: Under the assumptions of Proposition\^ let 

V = argmin^,{|| Mu — v' IIq + AP(v')}, then the coordinate- wise updates to ^ 
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are 



given by: v* = v/||v||q if ||v||q > and v* = otherwise, and with u* defined as 
above. When updated iteratively, these converge to a local optimum of 

Thus, the solution to our Generahzed RPLS problem can be solved by a generalized 
penalized least squares problem. Algorithmically, solving the multi-factor Generalized PLS 
and RPLS problems follow the same structure as that of Algorithm [TJ The solutions out- 
lined above replace Step 2 (b), with the altered Generalized RPLS factors and projections 
matrices replacing Steps 2 (c), (d), and (e). In other words, Generalized PLS or RPLS is 
performed by finding the GPCA or Regularized GPCA factors of a deflated cross-products 
matrix, where the deflation is performed to rotate the cross-products matrix so that it is 
orthogonal to the data in the Q-norm. Computationally, these algorithms can be performed 



efficiently using the techniques described in Allen et al. (2011) that do not require inversion 



or taking eigenvalue decompositions of Q. Thus, the Generalized PLS and RPLS methods 
are computationally feasible for high-dimensional data sets. 

We have shown the most basic extension of GPCA technology to PLS and our RPLS 
framework, but there are other possible formulations. For two-way data, projections in the 
"sample" space may be appropriate in addition to projecting variables in the Q-norm. With 
neuroimaging data, for example, the data matrix may be oriented as brain locations, voxels, 
by time points. As the time series are most certainly not independent, one may wish to 
transform these inner product spaces using another quadratic operator, W G 3?"^", changing 



M to X'^ W Y and Rfc to Rfe = [Rfc„i W z^. / W zj, analogous to |Allen et al] ( |201l[ ). 
Overall, we have outlined a novel extension of PLS and our RPLS methodologies to work 
with high-dimensional structured data. 

3.2 Non-Negative PLS 



Many have advocated estimating non-negative matrix factors (Lee and Seung, 1999) and 



non-negative principal component loadings (Hoyer, 2004) as a way to increase interpret 



ability of multivariate methods. For scientific data sets such as NMR spectroscopy in which 
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variables are naturally non-negative, enforcing non-negativity of the loadings vectors can 



greatly improve interpretability results and the performance of methods (Allen and Maletic 



Savatic, 2011). Here, we illustrate how to incorporate non-negative loadings into our RPLS 



framework. Consider the optimization problem for single-factor Non-negative RPLS: 

maximize v"^ M u — AP(v) subject to v"^ v < 1, u"^ u = 1 & v > 0. (3) 

v,u 

Solving this optimization problem is a simple adaption of Proposition [T] the penalized regres- 
sion problem is replaced by a penalized non-negative regression problem. For many penalty 
types, these problems have a simple solution. With the £i-norm penalty, for example, the 
soft-thresholding operator in the update for v is replaced by the positive soft-thresholding 



operator: v = P(Mu, A) = (Mu-A)+ (Allen and Maletic-Savatic 2011). Our RPLS 



framework, then, gives a simple and computationally efficient method for enforcing non- 



negativity in the PLS loadings. Also, as in Allen and Maletic-Savatic (2011), non-negativity 



and quadratic operators can be used in combination for PLS to create flexible approaches 
for high-dimensional data sets. 

4 Simulation Studies 

We explore the performance of our RPLS methods for regression in a univariate and a 
multivariate simulation study. 

4.1 Univariate Simulation 

In this simulation setting, we compare the mean squared prediction error and variable selec- 
tion performance of RPLS against competing methods in the univariate regression response 



setting with correlated predictors. Following the approach in Section 5.3 of Chun and Kele§ 



(2010), we include scenarios where n is greater than p and where n is less than p with dif- 



fering levels of noise. For the n > p setting, we use n = 400 and p = 40; for the n < p 
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setting, we use n = 40 and p = 80. In each case, 75% of the p predictors are true predictors, 
while the remaining 25% are spurious predictors that are not used in the generation of the 
response. For the low and high noise scenarios, we use signal-to-nose ratios (SNR) of 10 and 
5. 



To create correlated predictors as in Chun and Kele§ (2010), we construct hidden variables 
Hi, ... , H^, where Hi ~ A/'(0, 25I„). The columns of the predictor matrix Xj are generated 
as the sum of a hidden variable and independent random noise as follows: Xi = Hi + Si 
for 1 < i < 3p/8, Xi = H2 + Si for 3p/8 < i < 3p/4, and Xi = H3 + Si for 3p/4 < i < p, 
where Ei ~ Af{0, I„). The response vector Y = 3Hi - AH2 + /, where / ~ A^(0, 25I„/SNR). 
Training and test sets for all settings of n, p and SNR are created using this approach. 

For the comparison of methods, X and Y are standardized, and parameter selection is 
carried out using 10-fold cross validation on the training data. For the sparse partial least 



squares (SPLS) method described in Chun and Kele§ (2010), the spls R package (Chung 



et aH 2012) is used with 77 chosen from the sequence (0.1,0.2, . . . ,0.9) and K from 5 to 



10. Note that for our methods, we choose to select K automatically via the lasso penalized 



PLS regression problem described in Section 2.4 with penalty parameter 7. Thus for RPLS, 
lasso penalties were used with A and 7 chosen from 25 equally spaced values between 10~^ 
and log(max(|X'y I)) on the log scale. For the lasso and elastic net, the glmnet R package 



(Friedman et al. , 2010) is used with the same choices for A. 

The average mean squared prediction error (MSPE), true positive rate (TPR), and false 
positive rate (FPR) across 30 simulation runs are given in Table [l} The penalized regression 
methods clearly outperform traditional PLS in terms of the mean squared prediction error, 
with RPLS having the best prediction accuracy among all methods. SPLS and RPLS are 
nearly perfect in correctly identifying the true variables, but SPLS tends to have higher 
rates of false positives. In contrast, the lasso and elastic net have high specificity, but fail to 
identify many true predictors. 



16 



Simulation 1: n = 400, p 


A r\ TV TT^ 

= 40, SNR 


= 10 


Simulation 2: n = 400, p 


= 40, SNR 


= 5 


Method 


MSPE (bE) 


IPR (oE) 


I'PR (oEj 


Method 


MbPE (SE) 


IPR (SE) 


i'PR (SE) 


r Lib 


0U4.Z 






r Lib 


^^^^ 








(293.8) 








(212.9) 






sparse ruD 


70 R 


1.00 


0.61 


bparse rLb 


1 /I Q 7 


1.00 


0.66 




(4.1) 


(0.00) 


(0.27) 




(9.8) 


(0.00) 


(0.29) 


RPLS 


66.4 


1.00 


0.22 


RPLS 


131.4 


1.00 


0.19 




(3.8) 


(0.00) 


(0.35) 




(9.3) 


(0.00) 


(0.37) 


Lasso 


70.9 


0.60 


0.00 


Lasso 


139.3 


0.49 


0.00 




(4.9) 


(0.07) 


(0.02) 




(9.5) 


(0.07) 


(0.00) 


Elastic net 


70.5 


0.61 


0.01 


Elastic net 


139.0 


0.50 


0.00 




(4.5) 


(0.07) 


(0.03) 




(9.5) 


(0.07) 


(0.00) 


Simulation 3: n = 40, p = 


= 80, SNR = 


= 10 


Simulation 


4. 11 — 4U, p - 


= 80, SNR = 


= 5 


Method 


MSPE (SE) 


IPR (SE) 


IPR (SE) 


Method 


MSPE (SE) 


IPR (SE) 


tPR (SE) 


PLS 


624.1 






PT Q 
r Lb 


DIZ.D 








(256.5) 








(256.8) 






Sparse PLS 


104.9 


0.99 


0.77 


bparse r Lb 


ZUD.4 


0.98 


0.70 




(26.3) 


(0.05) 


(0.30) 




(53.9) 


( C\ C\'l\ 

(0.07) 


(0.31) 


RPLS 


76.0 


1.00 


0.45 


PPT Q 
xlr Lb 


1 1 

100. i 


1.00 


0.52 




(20.8) 


(0.00) 


(0.43) 




(59.0) 


( C\ (\C\\ 

(0.00) 


(0.43) 


Lasso 


83.7 


0.17 


0.02 


Lasso 


178.3 


0.12 


0.01 




(19.7) 


(0.04) 


(0.06) 




(49.7) 


(0.04) 


(0.02) 


Elastic net 


82.4 


0.17 


0.02 


Elastic net 


172.7 


0.12 


0.01 




(18.6) 


(0.03) 


(0.04) 




(46.0) 


(0.04) 


(0.03) 



Table 1: Comparison of mean squared prediction error (MSPE), true positive rate(TPR) and 
false positive rate (FPR) with standard errors (SE). 

4.2 Multivariate Simulation 

In this simulation setting, we compare the mean squared prediction error of regularized PLS 
against competing methods for multivariate regression. As in the univariate simulation, we 
include scenarios where n > p and n < p with varying levels of noise, but now our response 
Y is a matrix of dimension nx q with q — 10. For the n > p scenario, we use n — 400 and 
p = 40 with 5 true predictors. For the n < p scenario, we use n = 40 and p = 80 with 10 
true predictors. In each case, we test the methods using signal to noise ratios (SNR) of 2 
and 1. 

The simulated data is generated using 8 binary hidden variables Hi, . . . , with entries 
drawn from the Bernoulli(0.5) distribution. The coefficient matrix A contains standard nor- 
mal random entries for the first ptrue columns, with the remaining columns set to 0. The 
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predictor matrix X = H • A + E, where the entries of E are drawn from the A/'(0, 0. 1^) distri- 
bution. The coefficient matrix B contains entries drawn from the Af{0, SNR ■ n ■ g/tr(HH')) 
distribution. The response matrix Y = H ■ B + F, where the entries of F are drawn from the 
standard normal distribution. Both training and test sets are generated using this procedure, 
and both X and Y are standardized. 

For the penahzed methods including sparse PCA (SPCA) and regularized PLS (RPLS) 
the penalty parameter A is chosen from 25 equally spaced values between —5 and log(max(|X'Y|)) 
on the log scale using the BIG criterion. For RPLS, 7, the PLS regression penalty param- 
eter for selecting K, is chosen from the same set of options as A using the BIG criterion. 
To obtain the coefficient /3 = VZ'Ytrammg, the columns of V and Z were normalized. The 
results shown in Table [2] demonstrate that regularized PLS outperforms both sparse PGA 
and standard PLS. 

Simulation 1: n = 400, p = 40, SNR = 2 Simulation 2: n = 400, p = 40, SNR = 1 
Method MSPE (SE) Method MSPE (SE) 

"SPCA 2376.2 (337) SPCA 2204.1 (313) 

PLS 2567.7 (316) PLS 2343.3 (281) 

RPLS 404.7 (96) RPLS 339.4 (175) 

Simulation 3: n = 40, p = 80, SNR = 2 Simulation 4: n = 40, p = 80, SNR = 1 

Method MSPE (SE) Method MSPE (SE) 

SPCA 711.3 (107) SPCA 721.7 (109) 

PLS 647.0 (101) PLS 659.9 (81) 

RPLS 142.0 (3) RPLS 133.0 (3) 

Table 2: Comparison of mean squared prediction error (MSPE) with standard errors (SE) 
for multivariate methods. 



5 Case Study: NMR Spectroscopy 

We evaluate the utility of our methods through a case study on NMR spectroscopy data, a 
classic application of PLS methods from the chemometrics literature. We apply our RPLS 
methods to an in vitro one-dimensional NMR data set consisting of 27 samples from five 
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Training Error 


Leave-one-out CV Error 


PCA + LDA 


0.1167 


0.1852 


PLS + LDA (de Jung, 1993) 


0.0000 


0.1481 


GPCA + LDA (Allen et. al, 2011) 


0.1833 


0.1481 


GPLS + LDA 


0.0000 


0.1111 


SPCA + LDA (Shen & Huang, 2008) 


0.1167 


0.1481 


SPLS + LDA (Chun & Keles, 2010) 


0.0000 


0.1111 


SGPCA + LDA (Allen & Maletic-Savatic, 2011) 


0.1833 


0.1481 


SGPLS + LDA 


0.0000 


0.0741 



Table 3: Mis classification errors for methods applied to the neural cell NMR data. Various 
methods were used to first reduce the dimension with the resulting factors used as predictors 
in linear discriminant analysis. 





Time in Seconds 


Sparse PLS 


(via RPLS) 


1.01 


Sparse PLS 


(R package spls) 


1033.86 


Sparse Non- 


negative GPLS 


28.16 



Table 4: Timings Comparisons. Time in seconds to compute the entire solution path for the 
neural cell NMR data. 



classes of neural cell types: neurons, neural stem cells, microglia, astrocytes, and oglioden- 



drocytes (Manganas et al. , 2007), analyzed by some of the same authors using PCA methods 



m 



Allen and Maletic-Savatic (2011). Data is pre-processed in the manner described in Dunn 



et al. (2005): functional spectra is discretized into bins of size 0.04 ppms yielding a total 



of 2394 variables, spectra for each sample are baseline corrected and normalized to their 
integral, and variables are standardized. For all PLS methods, the response, Y is 27 x 5 and 
coded with indicators inversely proportional to the sample size in each class as described in 



Barker and Rayens (2003). For each method, five PLS or PCA factors were taken and used as 



predictors in linear discriminant analysis to classify the NMR samples. To be consistent, the 
BIG method was used to select any penalty parameters except for the Sparse PLS method 



of Chun and Kele§ (2010) where the default in the R package spls was employed (Chung 



et al. , 2012). The Sparse GPCA and Sparse GPLS methods were applied with non-negativity 



constraints as described in Allen and Maletic-Savatic (2011) and in Section 3 Finally, for 



the generalized methods, the quadratic operator was selected by maximizing the variance 
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Figure 1: Sparse Non-negative Generalized PLS loadings and sample PLS heatmaps for the 
neural cell NMR data. The loadings are superimposed on the mean scaled spectral intensities 
for each class of neural cells. 

explained by the first component; a weighted Laplacian matrix with weights inversely pro- 



portional to the Epanechnikov kernel with a bandwidth of 0.2 ppms was employed (Allen 



et ah, 2011). 



In Table |3| we give the training and leave-one out cross-validation misclassification 
errors for our methods and competing methods on the neural cell NMR data. Notice that 
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Figure 2: The first two PLS and Sparse PLS (Chun and Kele§, 2010) loadings and sample 
PLS heatmaps for the neural cell NMR data. The loadings are superimposed on the mean 
scaled spectral intensities for each class of neural cells. 



our Sparse GPLS method yields the best error rates followed by the Sparse PLS (Chun 



and Kele§, 2010) and our GPLS methods. Additionally, our Sparse GPLS methods are 
significantly faster than competing approaches. In Table [5} the time in seconds to compute 
the entire solution path (51 values of A) is reported. Timing comparisons were done on a 
Intel Xeon X5680 3.33Ghz processor using single-threaded scripts. 

In addition to faster computation and better classification rates, our method's flexibility 



21 



leads to easily interpretable results. We present the Sparse GPLS loadings superimposed on 
the scaled spectra from each neural cell type and sample heatmaps in Figure [T} For compari- 
son, we give the first two PLS loadings in Figure |2] for PLS and Sparse PLS ( [Chun and Kele§ 



2010). The PLS loadings are noisy, and the sample PLS components for PLS and Sparse 



PLS are difficult to interpret as the loadings are both positive and negative. By constraining 
the PLS loadings to be non-negative, the chemical shifts the metabolites indicative of each 
neural cell type are readily apparent with the Sparse Non-negative GPLS loadings. Addi- 
tionally as shown in the sample PLS heatmaps, the neural cell types are well differentiated. 
For example, chemical resonances at l.SOppms ad 3.23ppms characterize Glia (Astrocytes 
and Ogliodendrocytes) and Neurons (PLS 1), resonances at l.lQppms and 3.66ppms char- 
acterize Microglia (PLS 2), resonances at 3.23ppms 2.65ppms characterize Astrocytes (PLS 

3) , resonances at 1.30ppms, 3.02ppms, and 3.55ppms characterize Ogliodendrocytes (PLS 

4) , and resonances at 1.28ppms and 3.23ppms characterize Neural stem cells. Note that 
some of these metabolites were identified by some of the same authors using PGA methods 



m 



Manganas et al. (2007); Allen and Maletic-Savatic (2011). Using our flexible PLS ap- 



proach for supervised dimension reduction, however, gives a much clear metabolic signature 
of each neural cell type. Overall, this case study on NMR spectroscopy data has revealed 
the many strengths of our method as well as identified possible metabolite biomarkers for 
further biological investigation. 



6 Discussion 

We have presented a framework for regularizing partial least squares with convex and order 
one penalties. Additionally, we have shown how this approach can be extend for structured 
data via Generalized PLS and RPLS and extended to incorporate non-negative PLS or RPLS 
loadings. Our approaches directly solve penalized relaxations of the SIMPLS optimization 
criterion. These in turn, have many advantages including computational efficiency, flexible 
modehng, easy interpretation and visualization, better feature selection, and improved pre- 
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dictive accuracy as demonstrated in our simulations and case study on NMR spectroscopy. 

There are many future areas of research related to our methodology. While we have 
briefly discussed the use of our methods for general regression or classification procedures, 
specific investigation of the RPLS factors as predictors in the generalized linear model frame- 



work (Marx, 1996; Chung and Keles, 2010), the survival analysis framework (Nguyen and 



Rocke, 2002a), and others are needed. Additionally, following the close connection of PLS 



for classification with the classes coded as dummy variables to Fisher's discriminant analysis 



(Barker and Rayens, 2003), our RPLS approach may give an alternative strategy for regular- 
ized linear discriminant analysis. Further development of our novel extensions for GPLS and 



Non-negative PLS is also needed. Finally, Nadler and Coifman (2005) and Chun and Kele§ 



(2010) have shown asymptotic inconsistency of PLS regression methods when the number of 
variables is permitted to grow faster than the sample size. For related PCA methods, a few 



have shown consistency of Sparse PCA in these settings (Johnstone and Lu, 2009 Amini and 



Wainwright, 2009). Proving consistent recovery of the RPLS loadings and the corresponding 



regression or classification coefficients is an open area of future research. 

Finally, we have demonstrated the utility of our methods through a case study on NMR 
spectroscopy data, but there are many other potential applications of our technology. These 
include chemometrics, proteomics, metabolomics, high-throughput genomics, imaging, hy- 
perspectral imaging and neuroimaging. Overall, we have presented a flexible and powerful 
tool for supervised dimension reduction of high-dimensional data with many advantages and 
potential areas of future research and application. An R package and a Matlab toolbox 
named RPLS that implements our methods will be made publicly available. 
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A Proofs 



Proof of Proposition [I]. The proof of this result follows from an argument in Allen et al. (2011 1, but we 



outline this here for completion. The updates for u are straightforward. We show that the sub-gradient 
equations of the penalized regression problem, ^ 1 1 M u — v 1 1 + AP(v), for v* as defined in the stated result 
are equivalent to the KKT conditions of ([T]). The sub-gradient equation of the latter is, Mu— AVP(v*) — 
27* V* = 0, where VP() is the sub-gradient of P{) and 7* is the Lagrange multiplier for the inequality 
constraint with complementary slackness condition, 7*((v*)"^ v* —1) = 0. The sub-gradient of the penalized 
regression problem is M u — v — AVP(v) = 0. Now, since P() is order one, we this sub-gradient is equivalent 
to Mu— — AVP(v) for any c > and for v — cv. Then, taking c — l/||v||2 = I/27* for any v 7^ 0, we 
see that both the complimentary slackness condition is satisfied and the sub-gradients are equivalent. It is 
easy to verify that the pair (0,0) also satisfy the KKT conditions of ([!]). □ 

Proof of Corollary [1} The proof of this fact follows in a straightforward manner from that of Proposition 
[1] as the only feasible solution for u is u* = 1. We are then left with a concave optimization problem, 
maximizBv v-^M— AP(v) subject to v-^v < 1. From the proof of Proposition [l] we have that this opti- 
mization problem is equivalent to the desired result. Since we are left with a concave problem, the global 
optimum is achieved. D 



Proof of Proposition First, define Q to be a matrix square root of Q as in Allen et al. (2011 ). In this 



paper, they showed that Generalized PGA was equivalent to PGA on the matrix X = X Q for projected 
factors V = Q^V. In other words, if X = UDV is the singular value decomposition, then the GPGA solu- 
tion, V can be defined accordingly. Here, we will prove that the multi-factor RPLS problem for X and v^^ is 
equivalent to the stated Generalized RPLS problem ^ for A = 0. The constraint regions are trivially equiv- 
alent so we must show that v^Pfc_iM = Q Pfc_i Q M. The PLS factors, z^, = Xvj. = X Q v^, = zj,, are 
equivalent. Ignoring the normalizing term in the denominator, the columns of the projection weighting ma- 
trix are Rfc = X^z^ = X z^ = R^. Thus, the if' element of R^R = zf X QQ^ X^ z^ = Rf Q Rj 
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as stated. Putting these together, we have vJPi^^_iM = Q(I — Rfe_i(R^_j Q Ri:_i) ^Rj._j)Q X"^Y 
which simpHfies to the desired result. 

Following this, the proof of the first part is a straightforward extension of Theorem 1 and Proposition 1 

in Allen et al. (2011 1. The proof for the second part follows from combining the arguments in Proposition [l] 



and those in the proof of Theorem 2 in Allen et al. (2011 ). □ 
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